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Gene regulatory networks typically have low in-degrees, whereby any given gene is regulated by 
few of the genes in the network. They also tend to have broad distributions for the out-degree. 
What mechanisms might be responsible for these degree distributions? Starting with an accepted 
framework of the binding of transcription factors to DNA, we consider a simple model of gene 
regulatory dynamics. There, we show that selection for a target expression pattern leads to the 
emergence of minimum connectivities compatible with the selective constraint. As a consequence, 
these gene networks have low in-degree, and "functionality" is parsimonious, i.e., is concentrated on 
a sparse number of interactions as measured for instance by their essentiality. Furthermore, we find 
that mutations of the transcription factors drive the networks to have broad out-degrees. Finally, 
these classes of models are evolvable, i.e., significantly different genotypes can emerge gradually 
under mutation-selection balance. 

PACS numbers: 87.16.Yc, 87.18.Cf, 87.17.Aa 



I. INTRODUCTION 

It is by now well established that complex organisms 
typically do not have many more genes than less com- 
plex ones. Because of this, the paradigm for thinking 
about biological complexity has shifted from the number 
of genes to the way they work together: higher complex- 
ities might be associated with a greater proportion of 
regulatory genes. In particular, there are strong indica- 
tions in eukaryotes and prokaryotes that for increasing 
genome size the number of regulatory genes grows faster 
than linearly in the total number of genes [l[ 0] • Hence 
it is appropriate to consider biological complexity in the 
framework of interaction networks. This shift from com- 
ponents to the associated interactions has received in- 
creasing attention in many scientific communities, with 
applications ranging from network biology to sociology. 
The relevance of this conceptual framework for biology 
has been repeatedly emphasized and has benefited from 
inputs from other fields and from statistical physics in 
particular [3, 0] . We will therefore freely use the network 
terminology, refering to nodes, their degrees, distinguish- 
ing between in and out-degrees etc. 

From studies that strive to unravel gene regulatory net- 
works (GRN), several qualitative properties transpire: (i) 
a given gene is generally influenced by a "small" number 
of other genes (low in-degree of the network of interac- 
tions when compared to the largest possible degree [!,§); 
(ii) some genes are very pleiotropic (the out-degree of 
some nodes of the network can be high, leading to de- 
gree distribution with fat tails or even possibly scale-free 
behavior |,@); (in) GRN seem to be robust to change 
(e.g., to environmental fluctuations or to mutations), a 
feature that is also found at many other levels of bio- 



logical organisation |7HlO|. A simple way to build ro- 
bustness into a network is to have rather dense connec- 
tions, effectively incorporating redundancy, either locally 
or globally. Furthermore, the number of networks having 
m interactions grows very quickly with m. Thus when 
modeling GRN, the network realizations that perform a 
given regulatory function are dominantly of very high de- 
gree. However this is not the case experimentally, at least 
with respect to the in-degree, and so models so far have 
had to build-in limitations to the accessible connectivi- 
ties [TTl - [l3j . In this work we show that such shortcom- 
ings of models are overcome if one takes into account the 
known mechanisms underlying genetic interactions: gene 
regulation is mediated via the molecular recognition of 
DNA motifs by transcription factors, and this leads to 
biophysical constraints on interaction strengths. Within 
this relatively realistic framework, we shall see that net- 
works under mutation-selection balance: (i) are driven 
to be parsimonious (the essential interactions are sparse) 
for the in-degree; (ii) can have broad distributions for 
the out-degree which is unconstrained; (iii) can evolve to 
very different realizations while preserving their function. 

We begin by explaining the mechanisms incorporated 
into the model, in particular the determinants of the in- 
teractions. We follow standard practice [l3 - [l6| when 
modeling interactions between DNA binding sites and 
transcription factors (TF): the affinity depends on the 
mismatch between two character strings. We also spec- 
ify how gene expression dynamics depend on these inter- 
actions and what "function" the networks must imple- 
ment. The main difference between our approach and 
previous work is the introduction of a molecular defini- 
tion of genotypes; this more realistic setting means that 
mutations are no longer ad-hoc and interestingly this dif- 
ference leads to all the generic properties listed above. 
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In fact, we keep our model as simple as possible to drive 
home the fact that all these properties emerge quite in- 
evitably once such a setting is used. Clearly, our choice 
of a "minimal" model means that we must focus on qual- 
itative aspects of the problem without attempting to re- 
produce specific experimental data. 

After giving the general framework, we present some of 
the mathematical and computational tools we use to ana- 
lyze the model. First, we demonstrate that selection (the 
constraint of having a given function or "phenotype") 
makes the networks' connectivity be close to minimal, 
i.e., networks are as sparse as they can be subject to 
maintaining their function. The "spontaneous" appear- 
ance of sparsity is due to the fact only a tiny (negligible) 
fraction of functional genotypes correspond to non-sparse 
networks. This entropic property is in strong contrast to 
what happens in models formulated in terms of networks 
(rather than in terms of microscopic genotypes). A con- 
sequence of this sparsity is that the genotypes are quite 
robust to mutations [ol [T7l|: only those few binding sites 
that are "effectively" used are fragile, mutations of the 
other (little used) binding sites have almost no effect. 
Thus robustness to mutational changes is very high for 
most binding sites while the "essential" interactions have 
much lower robustness; robustness is heterogeneously dis- 
tributed in the network. Second, we find that the network 
out-degree is unconstrained, but that under evolutionary 
pressure coming from mutation-selection balance, broad 
out-degree distributions are favored. Implications of our 
work are developed in the discussion; in particular, a con- 
sequence is that redundant interactions are rapidly elim- 
inated under evolution if no new function arises which 
might change the selection pressure. 



II. THE MODEL 
A. Model genealogy 

Gene regulatory networks play an essential role in cel- 
lular dynamics. A question of major interest is how cells 
maintain their integrity; this includes stabilizing gene ex- 
pression in the presence of environmental or genetic per- 
turbations p^ . actively maintaining oscillatory dynam- 
ics as in circadian clocks [l^, or responding transiently 
and in a timely way to external signals In the ma- 
jority of studies tackling these issues theoretically, one 
attempts to adjust model or kinetic parameters in gene 
circuits to obtain a GRN with the desired properties. 
Our focus here is different: we ask how functionality 
constrains network structure. Answering this requires 
considering the "space" of all functional GRN and de- 
termining the generic properties arising in this space. In 
particular, we shall be interested in the topological fea- 
tures and the robustness of these GRN. 

The use of a space of GRN to propose "design prin- 
ciples" of gene regulation has been exploited previously 
by several groups (9, 11]. Such a paradigm enables one 



to put forward certain generic behaviors, and it is often 
complementary to other approaches. Perhaps the most 
widespread such framework was made popular by Stuart 
Kauffman (see [ll| and references therein) using random 
boolean networks. Each node of the network is associated 
with a gene and its expression level which, by assump- 
tion, is a boolean variable (1/0 for on/off) representing 
the possible two extreme expression levels of the gene. 
The gene expression dynamics at each node is taken as a 
random boolean function of its inputs; these inputs are 
the gene expression levels of the nodes it is directly con- 
nected to. Gene expression patterns or "phenotypes" are 
then represented by iV-dimensional vectors with boolean 
components. Among other subjects, this framework was 
used to study a number of generic properties arising in 
this ensemble such as robustness to noise and mutations, 
evolutionary paths in classes of fitness landscapes and 
consequences for network topology. 

Kauffman's approach led to the creation of entire 
familes of boolean models. Further steps were made 
by people trying to get additional insight by putting 
more knowledge into the boolean functions. Many works 
did this by focusing on "threshold" networks which had 
already been widely used in neural network modeling. 
There, the random boolean functions for each node are 
replaced by an "integrate and fire" type relation. To this 
end, one introduces an N x N matrix of inter-gene inter- 
actions [2l| where Wtj is the strength of the interaction 
of the j-th gene on the z-th one. (Notice that this ma- 
trix can be regarded as representing a directed weighted 
graph.) Typically these models remain boolean, so that 
the "output" is "on" if and only if the incoming (inte- 
grated) signal is above a threshold; the framework can 
then be considered as a refinement of Kauffman's. Mod- 
els of this class have also been very popular, and have 
been used to study robustness, evolvability and other 
"generic" properties p3 - l26| . However, their main short- 
coming is that the Wij weights are introduced in an ad- 
hoc way, with the consequence that any evolution of these 
networks is also arbitrary. In particular, one can deplore 
that these weights are in no way connected to known fea- 
tures of the underlying microscopic processes giving rise 
to genetic interactions. 

Although we conceptually borrow much from the work 
of our predecessors, we also depart considerably from 
these earlier studies: we abandon the boolean approx- 
imation, we adapt known thermodynamical considera- 
tions to formulate our transcriptional dynamics, and we 
make use of the present knowledge of molecular processes 
to introduce in a realistic fashion interaction weights Wij 
and their dependence on mutations. 



B. Genotypes 

As already mentioned, GRNs are often represented by 
a directed weighted graph, i.e., hy an N x N matrix 
{Wij}, where N is the number of genes belonging to 



3 



the GRN. The {W^j} matrix can a priori be quite ar- 
bitrary. In complete generality, the products of the N 
genes can have regulatory influences on the same set of 
genes (retroaction) and possibly also have some "down- 
stream" consequences on other genes that do not code for 
TF. However the consequences of these last effects can be 
ignored for our purposes since they lead to no feedback 
on the N "core" genes. The focus of all this paper is thus 
on such a core GRN, containing only TF coding genes. 
Note that these restricted networks have similar statisti- 
cal properties to the unrestricted ones, namely sparsity, 
very low values for the in-degree, and a fat tail for the 
out-degree distribution [6J. 

Since it is the affinity between the transcription factor 
j and its binding site (a DNA sequence) in the regu- 
latory region of gene i that microscopically controls the 
level of transcription, we want Wij to be a measure of this 
affinity. An affinity depends on the binding free energy 
which itself determines the frequency with which these 
two molecules will be bound rather than unbound. Fol- 
lowing standard practice, we represent each TF as well 
as each binding site by a character string of length L\ we 
also impose these characters to belong to a 4 letter alpha- 
bet in direct analogy with the four bases of DNA. The 
list of these strings then defines the molecular genotype 
of the system under consideration. 

How can one connect this molecular genotype to a set 
of interactions Wij in the GRN? Building on the well 
known work of Berg and von Hippel ^14.] . we assume 
for the sake of simplicity that the free energy of one 
TF molecule bound to its target is, up to an additive 
constant, equal to edij, where dij is the number of mis- 
matches between the strings representing the j-th TF 
and its binding site in the regulatory region of gene i. 
The parameter e is the penalty for each mismatch, i.e., 
the contribution to binding free energy in units of ksT 
where fc^ is the Boltzmann constant and T is the temper- 
ature in degrees Kelvin. Experimentally, e is inferred to 
have values between one and three if one thinks of each 
base pair of the DNA as being represented by one charac- 
ter |27h29| | . Note also that comparing to the typical num- 
ber of base pairs found in experimentally studied binding 
sites leads to 10 < L < 15. Given this framework, we now 
define the "interaction strengths" Wij arising in such a 
GRN via the Boltzmann factor 

W,j =e-'''^^/Z (1) 

where Z is a normalization (actually a partition func- 
tion). If there were just one transcription factor molecule 
of type j, Wij would be the probability to find that 
molecule bound onto its binding site in the regulatory 
region of gene i. Gerland et al. [l^ have shown that 
the term Z in Eq. ([T]) is in practice close to 1 and that 
the probability of finding a particular transcription factor 
molecule bound at a given binding site is low. 

If we consider the space of all genotypes, that is all 
possible character strings in our model with equal prob- 
ability, the a priori distribution (coming from random 



strings) of the mismatch d for any given pair («, j) is bi- 
nomial: 

pW= Q)(l/4)^-'^(3/4)'^. (2) 

Then for the biologically realistic values of L mentioned 
above, using for instance Stirling's formula, we have 

p{d) < 1 '^hen d <^ L . (3) 

Hence small mismatches are very improbable, a fact that 
will be of utmost importance later on. 

C. Occupation of the binding sites 

Suppose, that there are rij TF molecules of type j that 
can bind to a site in gene i's regulatory region; given 
that this site can be occupied only by one TF molecule 
at a time, it is necessary to take into account the possible 
competition. Using the fact that Z in Eq. ([T]) is close to 1, 
it is possible to approximate the occupation probability 
of the binding site by (T6| : 



l + \/{n,W,j)- 

In the following, we denote by n the maximum num- 
ber of TF molecules that can arise when a gene is fully 
"on" , taking for simplicity this maximum to be indepen- 
dent of j. Biologically, n is known to have a wide range, 
going from of order unity to many thousands. The lower 
value comes from some measurements of the multiplic- 
ity of transcripts [s^, [3l[ , while the higher value comes 
from other measurements of the numbers of transcription 
factor molecules (32|. For our study, we consider several 
values of n, reporting on the range 10 < n < 10"*. We also 
considered smaller values for n and found little change, 
but when n is of order 1, our "mean field" framework 
which neglects fluctuations can no longer be justified. 

Setting n to be j-independent is a rather strong as- 
sumption. It is made here for the sake of simplicity, to 
avoid a proliferation of free parameters. The implication 
is that we abandon any attempt to introduce external 
control factors. For example, it is known that the prob- 
ability of a CAP molecule to attach to DNA and recruit 
a polymerase depends not only on the number of these 
molecules but on the concentration of glucose [s^l- By 
the same token, including co-factors or repressors or mod- 
eling more refined feedback circuits is beyond the scope 
of this paper. 

D. Phenotypes 

Call Sj{t) the normalized level of gene j's prod- 
ucts at time t, corresponding to a total number of TF 
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molecules (sj] equal to nSj{t). As already mentioned, 
we abandon the boolean approximation whereby a gene's 
expression level is either at its minimum or its maximum. 
Instead, we allow all intermediate values making Sj (t) a 
continuous variable. In practice, we take all the Sj{t) 
to lie in the interval [0, 1], corresponding to a minimum 
number of TF molecules of a given type equal to and a 
maximum number equal to n. A short remark is in order 
for readers nourished with boolean models to avoid con- 
fusion: of course, instantaneously, a gene is either being 
transcribed (on) or not (off). However, these transcripts 
have to be translated into proteins (TFs) and these tran- 
scription factors will then have a certain half-life. Thus, 
it is necessary to distinguish the "effective" or average 
level of expression of a gene j (here the number of TF 
molecules of type j present) from the instantaneous tran- 
scription rate. In a model like this one, only the effective 
expression level is relevant. 

At any given time, there is a vector of effective ex- 
pression levels. In the next subsection we will define a 
deterministic dynamical system to model the time evolu- 
tion of this vector. After possible transient behavior, at 
large times the set of expression levels {Sj}j=i,,,,N may 
go to a (time-independent) steady state; we refer to this 
final vector as the "phenotype" of the GRN. It is also 
possible for the dynamics to go into a cycle (periodic 
behavior), but we shall not consider those GRNs in our 
analysis. Note that which case arises may depend on the 
initial expression levels. 

GRNs enable the homeostasis of gene expression, al- 
low for a response to a stimulus, or realize a new func- 
tion such as cellular differentiation. In this work we con- 
sider that the "function" of our GRN is to bring the 
gene expression to a given pattern and maintain it there. 
This choice is motivated from cases arising in early 
embryo development; there, the initial expression levels, 
{5'j™***"'''}j=i^...Ar, are given (inherited during the for- 
mation of the egg). Then a network will "perform the 
desired function" if and only if, starting with the ini- 
tial expression pattern, the vector of expression levels 
converges to a steady state which is close enough to a 
"target" pattern, which again must be given a priori. 
We wiU denote these target levels by {5'j-*°'^^'^*^}j=i,...Ar. 
Hereafter we loosely refer to a network as "viable" if it 
statisfies well this functional property, i.e., if its pheno- 
type is sufficiently close to the target one. 

E. Expression pattern dynamics 

The dynamics of the Sj{t) takes place on biochemical 
time scales (typically minutes). To model these dynam- 
ics, we consider that the normalized level Si{t+1) of the 
i-th gene at time is strongly associated with its tran- 
scription rate at time t. Clearly, that transcription rate 
depends on the degree to which the gene's regulatory re- 
gion is occupied by TFs. To find the probability that a 
TF of type j is bound to gene z's regulatory region, we 



use Eq. but where the term nj is replaced by nSj (t) : 

P.,, = + h/W.,,S,{t)) . (5) 

Here we have introduced h = 1/n which plays the role of 
an effective threshold for the action of Wij . This helps in 
comparing our results with those obtained with threshold 
models (including those from ref. [35|). 

To keep the framework simple, we shall suppose that 
gene i's transcription is "on" whenever at least one TF 
is bound within its regulatory region and otherwise it 
is "off". This is reminiscent of an "OR" logical gate 
whereby the output is on if and only if at least one of the 
inputs is on. The (normalized) mean expression level of a 
gene is then identified with the probability that transcrip- 
tion is on. Taking the TF occupancies in the regulatory 
region to be statistically independent, we then have 

S,it + l) = l-ll{l-P,,) (6) 
j 

Eqs. and ([5]) define our discrete time transcriptional 
dynamics. 

Two qualitative remarks are in order. First, we have 
taken the occupation probabilities of the different TFs 
to be independent; our framework thus ignores collective 
binding effects and in particular interactions between TF. 
Second, we have restricted the transcriptional logic to be 
of the "OR" type. Our TF thus act only as enhancers, 
never as repressors, and they do not cooperate using more 
complicated logic [36j . Since repressors are widespread in 
GRN, we have looked into generalizations of our model: 
we allowed repressor interactions while at the same time 
forcing multiple target expression patterns for a GRN to 
be viable. The resulting GRN are still sparse and their 
out-degree distribution still has a fat tail. However, since 
this paper's goal is to show how sparsity and high out- 
degrees emerge within minimal hypotheses, we postpone 
presenting these generalizations to a later work. 

One can also ask whether our model is structurally ro- 
bust. Any real expression dynamics will include fluctua- 
tions in the number of TF molecules [3l|, [13, whereas 
our Si{t) dynamics are mean-field-like and thus neglect 
fiuctuations. In general such an approximation can be 
justified when the number of molecules involved is large, 
i.e., in our case when n is large. However, it is rarely 
clear a priori what is "large enough" . Thus we have in- 
vestigated the effects of adding noise to Eq. ([B]), letting 
the number of molecules nSi(t) become a Poisson random 
variable whose mean is given by that equation. Interest- 
ingly, none of the properties of the resulting GRN are 
significantly modified, even when going down to n = 10. 
This can be understood qualitatively by considering the 
robustness of Eq. ([5]) . in that equation will be affected 
by noise in Sj{t), but much less than by the modification 
of Wij induced by changing the mismatch dij by one unit. 
One can then expect that high level of noise in the ex- 
pression dynamics is unable to compensate the relevant 
mismatches present in the genotype. 
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FIG. 1: Sciiematic representation of the regulatory region of 
gene i: there are A'^ binding sites, each labeled by an index 
i (1 ^ i ^ N). Represented is the interaction Wij mediated 
by the binding of TF j to the j'th site of that region. The 
binding affinities depend on the mismatch between the string 
of length L representing the TF and that representing the 
DNA of the corresponding binding site. 



III. IMPLEMENTATION 

To minimize the complexity of the model, we consider 
that each gene's regulatory region consists of N putative 
binding sites, one for each of the TV types of TFs as illus- 
trated in Fig.[T] If gene's j normalized expression level is 
Sj , it will produce a certain number nSj of TF molecules 
of type j. 

The genotype of one of our networks consists of many 
strings of length L containing characters from a four let- 
ter alphabet. Explicitly, for each of the N genes, one 
string encodes the TF and N strings encode the DNA 
sequences of the different binding sites in the regulatory 
region of that gene. There is thus a total of N{N + 1)L 
characters specifying each genotype. This in fact defines 
the (discrete) space of all genotypes. However we are 
not interested in all genotypes, we want those that have 
"good" phenotypes. This leads us to define quantita- 
tively a fitness for each genotype as follows. 

Let us first define a "distance" between two phenotypes 
S and S' using the differences in expression levels for each 
gene: 



(7) 



Algorithmically, we consider that the dynamical process 
Eq. © has reached a fixed point if D{S{t + l),S{t)) < 
10~^. Given a phenotype, we take its fitness to be 



F{S) = exp {~fD{S, S(*"''9'=*))) 



(8) 



where / is a control parameter. A simple argument 
indicates that / should be quite large: the maximum 
expression of gene i (i.e., Si = 1) corresponds to the 
production of n -I- 0{y/n) TF molecules if one allows 
for Poisson noise. (This noise is unavoidable: diffusion 
as well as other processes necessarily lead to statistical 
fluctuations.) The relative fluctuation in Si is there- 
fore 0(I/-\/n). Such a fluctuation must be innocuous 
for the cell and must not generate signiflcant loss in fit- 
ness. Hence, one should roughly have ] j \fn of order 1 or 



less, and thus / — 0{y/n) or less. In our simulations we 
use f — 20; the results depend only very weakly on this 
choice provided / is in the range 10 to 100. 



Given this definition of fitness, we can sample the fit- 
ness landscape of our system by Monte Carlo importance 
sampling. The goal is to bypass genotypes of low fitness 
and to focus instead on those genotypes whose phenotype 
is close to the target one. A standard approach to do 
this is to use the fitness as the measure to be used in the 
sampling: each genotype will appear with a probability 
proportional to its fitness. We do this by the Metropo- 
lis algorithm, producing a (biased) random walk in the 
fitness landscape which visits successive genotypes. At 
each step of the algorithm, we go to a neighboring geno- 
type (in practice this is done by changing one character in 
the strings defining the genotype). Then the phenotype 
and fitness of this modified genotype is determined; the 
modified genotype is accepted or not, and the process is 
repeated. The acceptance procedure uses the Metropolis 
rule: if the fitness has increased, the modified genotype is 
accepted; if the fitness has decreased, the modified geno- 
type is accepted with a probability given by the ratio of 
the new and old fitnesses. After many such steps, the al- 
gorithm will sample the space with the specified measure. 
In the next section, we shall consider the case where the 
TF character strings are fixed; the sampling of genotypes 
then restricts the modifications to arise only in the reg- 
ulatory regions. In that case, we use "sweeps" , a sweep 
being a succession of LN'^ attempted modifications. 



For simplicity, we shall choose the {5*, 



(initial) 



}] = !,. ..N 



and {5f'"^^'*^}j=i,...7v to be "on" (1) or "off" (0). If 

(initial) , ^(tarqet) , , , , i 

' " ' g^j-g drawn at random, the num- 
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ber of components set to 1 will be approximately equal 
to that set to when N is large. For the results pre- 
sented here, these numbers are set to N/2 exactly as this 
reduces finite size effects in N. One can of course also 
use other choices for these numbers. To check the robust- 
ness of our model's properties, we have examined how the 
network connectivity is affected when we take S(*'^''9'=*) to 
have iV/4 or 3N/A components set to 1 instead of the N/2 
used throughout this paper. Interestingly, in all cases, we 
find that the same qualitative properties emerge, namely 
that for each gene that is "on" in S^*'^''^^*) there is one 
incoming essential interaction (as defined later), while 
the out-degree distribution of the network has a fat tail. 
Coming back to our present choice motivated by drawing 
the expression vectors at random, we see that because of 
the permutation symmetry of the model, one can always 



swap the indices so that S,^ 



(initial) 



1 for J < N/2 and 



otherwise; furthermore we also impose without loss of 



generality Si 



(target) 



1 for iV/4 < i < 37V/4 and other- 



wise. Notice that J2i ^, 

EQ(initial) Q(target) 
i "^i — 



(initial) 

7V/4. 



{target) 



N/2 and 



IV. BINDING SITE MUTATIONS AND 
SPARSITY 

A. The emergence of a viable genotype 

We start by considering the TF encodings as given and 
fixed. The justification for this comes from the fact that 
because TF are typicaUy pleiotropic, they are subject 
to strong stabihzing selection; they are thus generally 
thought to evolve slowly, while DNA regulatory regions 
typically have a high level of polymorphism and evolve 
more quickly [s^. In this and the next subsection, we 
adopt, as a first approximation, the assumption that the 
genes (and thus the TF they code) are fixed whereas the 
strings of characters representing the DNA binding sites 
are unconstrained. A consideration of the effects of TF 
mutations is postponed to the following section. 

As already mentioned, the genotype determines the 
list of weights Wij , corresponding to a weighted oriented 
graph. The first problem we address is the emergence 
of a viable genotype starting from an entirely random 
genotype. This can also be done by taking the Wij's 
using mismatches generated from the binomial distribu- 
tion Eq. ([2]). Random genotypes will produce very poor 
(unfit) phenotypes, but if the system is allowed to evolve 
under a selection pressure on the fitness, then it is possi- 
ble that the genotypes will undergo gradual improvement 
until they are viable (highly fit). In our system, we thus 
ask for a steady state pattern S and subject the genotypes 
to a selection, for instance a selection proportional to the 
fitness as given in Eq. ([5]). If such selection is done on 
small populations, the resulting dynamics is quite similar 
to what arises in the Metropolis random walk algorithm, 
so we have implemented this in silico evolution process 
using the Monte Carlo algorithm described in the previ- 
ous section. 

In effect, identifying each of the attempted changes in 
our algorithm as a mutation, we examined the change 
in the fitness with increasing number of mutations. The 
result of this simulation is shown in Fig. [21 where we plot 
the fitness (actually the distance £)(S, S(*''''f'=*))) versus 
the number of mutational sweeps for two typical realiza- 
tions. It is instructive to follow such a trajectory in more 
detail. Let us rewrite Pij{t) {cf. Eq. ([5])) as 

P,,{t)^S,{t)/{S,{t)+x.,), (9) 

where Xij = h/Wij. In a random system, the Wij's are 
grouped around the average value 

{W,j) = (0.25 -I- 0.75 e-")^ . (10) 

When i = 12 and e — 2, this quantity is fairly small, 
about 3.6 X 10^^. Hence, for h > lO^'', all Xij have large 
values. Consequently, for the initial (random) genotypes, 
Eq. (jni) yields Si ^ on the l.h.s., and this is of course a 
steady state expression pattern. Since, by construction, 
g(target) j^g^g jy^2 elements equal to 1, all other being 0, 
the distance is then D « N/2. Such a situation may 
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500 1000 

#sweeps 

FIG. 2; Building up the stationary expression pattern S to 
^{target) ^ Starting with a random genotype. Shown are two 
typical trajectories as a function of the number of mutational 
sweeps. Here = 20, L = 12, e = 2.0 and h = 0.01. 

persist for a rather long time, until as the result of con- 
tinuing mutations an unlikely event occurs: a diagonal 
element of the array Xij , say Xkk , becomes small enough 
by chance. Then, one of the dynamical equations basi- 
cally reduces to 

Sk{t + 1) « Sk{t)/{Sk{t) + Xkk) , (11) 

with a non-trivial stationary solution 5*^ « 1 — Xkk ■ For 
Xkk — h cxp{edkk) to be 0(1) or smaller, dkk must be less 
than dfi = [ln{l/h)/e], a value we call "critical" here- 
after. For the parameter range under consideration, the 
generic result is that Sk approaches values close to 1 and 
a relatively strong interaction appears. Such an interac- 
tion - which will turn out to be "essential" as explained 
soon - is characterized by a subcritical mismatch. In this 
situation, the distance D to the target phenotype drops 
by an amount of order 0(1). Notice, that a subsequent 
significant increase of this distance is highly improbable 
because of the selection pressure: one unit of increase 
is counter-selected by a factor exp(— y^) from what was 
explained in the previous section. 

As the process continues with additional mutations, 
the increase in fitness dramatically accelerates. This is 
because the presence of every essential interaction has 
a double effect: it renders more probable the formation 
of new essential interactions which no longer need to be 
located on the diagonal, and it tends to lift up simultane- 
ously several Si's. The distance to the target phenotype 
then rapidly drops step by step until the system reaches 
a regime where D fluctuates around a plateau value. For 
the choice of parameter values used in Fig. [51 this leads to 
Si ~ 0.94 for each i satisfying — while the re- 

maining expression levels are negligibly small. From this 
"evolution" experiment, we see that our model allows 
for evolvable genotypes, in the sense that a new function 
can be acquired under realistic selection pressures if given 
enough time. 
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It should be emphasized that a large auto-regulation 
Wkk appears as the first noticeable evolutionary event 
(when a diagonal mismatch becomes small enough by 
chance). While such a direct auto- regulation is neces- 
sary to initiate the turning on of genes in our simulation 
of evolvability, it is not crucial later. We have thus mea- 
sured the fraction of large auto-regulatory interactions at 
long times (when the Monte Carlo provides equilibrium 
samples of viable) to see if such interactions remain fa- 
vored. The answer is no: the fraction of auto-regulatory 
interactions is no larger than expected by chance. 

We should also mention that once a viable genotype 
has been found, in practice we find it to be viable also 
when using other choices for g^™***"') other than S = 0. 
(Thus the convergence of Eqs. (jH]) and (O to the fixed 
point is largely independent of the starting expression 
pattern.) It seems that without introducing repressor 
interactions (which is beyond the scope of this paper), it 
is not possible to condition viability on g in our 

model. 




FIG. 3: Distribution of the Hamming distance between a TF 
and the receiving DNA site for iV = 20,L = 12,e = 2.0 
and for various values of the threshold parameter: (diamond) 
h — 0.1, (square) h = 0.01, (circle) h = 0.001, (triangle up) 
h = 0.0001, (triangle down) binomial. The lines are to guide 
the eye. 



B. Sparse essential interactions 

Remaining in the framework of having the TF encod- 
ings fixed, the Monte Carlo evolutionary dynamics just 
described will generate at large times an equilibrium dis- 
tribution of fit (viable) genotypes. Recall that, because 
of the choice of our acceptance criterion, the algorithm 
will sample genotypes with a frequency proportional to 
their fitness. Hereafter we shall call this the space of vi- 
able genotypes. Viable genotypes define in essence a null 
hypothesis since that space determines the distribution 
of any CRN property arising from the sole constraint of 
function, i.e., the only constraint imposed on the CRN is 
that they be fit. We will now see that the vast majority 
of these CRN are in fact sparse, a feature that does not 
at all arise in CRN models that ignore the microscopic 
origin of the Wij . 

For our computational study, as suggested in Fig. [21 
it is best to perform measurements every few hundred 
sweeps so that the data collected are not overly corre- 
lated. We typically used 10000 measurements separated 
by 100 sweeps. In Fig. [3] we show the distribution of the 
mismatch d when there are iV = 20 genes, for increas- 
ing values of the "threshold" parameter h. At the low 
values oi h, d has a binomial distribution with a peak 
near d = 3L/4 as expected. However at small d one ob- 
serves significant deviations. In fact, as h increases, the 
viability constraint becomes more marked and the dis- 
tribution becomes bimodal: a second peak appears at 
low mismatch values. Note that this peak shifts as h in- 
creases, indicating that there are nearly perfect matches 
that appear in that regime. 

Two remarks are in order: for intermediate values of 
h it may happen that the peak at small d extends over 
two neighbor values of d; the strong interactions then 
do not necessarily have the same strength. (Note that 
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FIG. 4: Table of mismatches dij between the j-th TF and 
its binding site in the regulatory region of gene i for a toy 
GRN with N = 8 (other parameters are as in Fig. [2|). The 
mismatches corresponding to essential interactions are shown 
in boldface. In virtue of Eq. the strength of the essential 
interactions is « 0.135 while the other interaction strengths 
are negligible (at most of order 10~^). On the r.h.s. we have 
drawn the network, keeping essential interactions only. The 
level of expression of the genes that are active in the target 
state is « 0.926 and that of the inactive ones ranges between 
3 X 10"*' and 6 x 10"*. 



because the d take on discrete values, so do the Wij.) 
Also, one could argue that the fitness parameter / should 
be scaled like ^/h when h is changed. We have carried out 
calculations with such a rescaling but this change does 
not modify substantially the overall picture. 

In Fig. U we show an illustrative case of the mis- 
matches for a viable network with iV = 8, e = 2.0 and 
h — 0.01. The bimodal nature of the mismatch distri- 
bution is clearly apparent; note in particular that the 
distribution at large d is broad while that at small d is 
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narrow. Recalling that Wij = exp(— edy ), the interac- 
tion matrix can be considered to be sparse if we focus 
only on strong interactions. However doing so requires 
defining rather arbitrarily a distinction between strong 
and not strong Wij. This difficulty is inherent to our 
framework where Wij is never 0, in contrast to the situ- 
ation in other gene network models where an interaction 
is present or absent. 

Since reality does not follow a "present/absent" di- 
chotomy for interactions, it is desirable to find a way to 
define mathematically a notion of sparsity in our context 
without too much arbitrariness. For this, let us consider 
not the weights themselves but their functionality. For 
a given genotype, call an interaction Wij "essential" if 
when setting it to zero viability is lost. With this defini- 
tion, one can ask whether viable genotypes have sparse 
essential interactions, and in particular whether there are 
few essential interactions per row of the matrix {W^j}. 
Notice that the number of active rows {i.e., having not 
too small expression levels) must be equal to the number 
of genes that are "on" in the target phenotype, N/2 in 
our case, otherwise the fitness is far too low. We find 
that as soon as h is not too small, there is almost always 
just one essential interaction per row as shown in Fig. [5] 
for = 20 and L — 12. The same result holds for other 
relevant values of N, n and L, suggesting that within 
our model, the drive towards sparse interactions arises in 
regimes of biological relevance. 

We also considered a stronger measure of essentiality: 
we asked that viability be lost when the interaction's mis- 
match is increased by one. Remarkably, the rule "one 
essential incoming interaction per gene" generally held 
here too. Thus mutations in these interactions are typi- 
cally deleterious, while mutations in the vast majority of 
the other interactions have no consequence on the fitness. 
This shows that mutational robustness is very heteroge- 
neously distributed among the interactions in the net- 
work. The average robustness of fitness with respect to 
binding site mutations, Rbs, is readily estimated: there 
are N/2 sensitive interactions out of N'^, hence one ex- 
pects Rbs ^ 1-1/27V. And indeed we find Rbs = 0.977(2) 
for iV = 20 (with some weak dependence on h in the third 
decimal). 

A consequence of the above is that in each active row 
of a viable network the mismatch distribution is (semi- 
quantitatively) well represented by the ansatz {H{x) = 1 
for a; > and otherwise): 



p{d) 



N ~ 1 
N 



p{d) 



1 p{d)H{dh ~ d) 



(12) 



with a very simple interpretation: the shape of the prob- 
ability distribution of any mismatch is essentially that 
without the viability constraint, but with an additional 
peak at small values of the mismatch. There is thus one 
"leading" mismatch taking care of most of the viability 
constraint, while the other mismatches behave approxi- 
mately as if they were unconstrained. 
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FIG. 5: Probability distribution of the number of essential 
interactions per row of the matrix specifying a viable network 
for = 20, L = 12,e = 2.0 and a range of values of h. 



It is interesting to see how the essential interactions 
are distributed among the columns of {W^y }■ First of all, 
they may appear in N/2 columns only (in our example). 
This is an immediate consequence of the target pheno- 
type being a fixed point of the dynamical process Eq. ^ . 
A simulation yields the result that, as one might expect, 
the iV/2 essential interactions are distributed completely 
randomly among the N/2 active columns. Since the num- 
ber of essential interactions in a column is a measure 
of the effective out-degree of the corresponding node of 
the GRN (in the theory of weighted networks one would 
rather refer to the strength of the network node), one 
readily convinces oneself that the out-degree distribu- 
tion is binomial with probability parameter equal to 2 /N . 
More generally, let K be the number of active genes in 
^{target) _ Viable GRN win typically have K essential re- 
actions, all connecting these active genes. Furthermore, 
these essential interactions define a subnetwork where the 
in-degree is almost always 1 while the out-degree has a 
binomial distribution that is well approximated by a Pois- 
son law of mean 1. 

A long succession of binding site mutations can move 
an essential interaction from one column to another. A 
few hundred sweeps are sufficient for that. Measuring the 
moments of the out-degree distribution one can calculate 
the autocorrelation time of this process. It turns out to 
be roughly of the order of 10"^ sweeps for h = 0.01 and 
about 10^ sweeps for h = 0.001. This implies that very 
different genotypes can arise under mutation-selection 
balance if given enough time; our GRN model thus al- 
lows for gradual change of genotypes while maintaining 
phenotypes, a characteristic of evolvability. 



V. TF MUTATIONS AND A POPULATION 
DYNAMICS APPROXIMATION 

Up to now we assumed that the strings of characters 
associated with TFs were fixed. Of course, the genes 
coding for TFs can mutate. These mutations have very 
specific consequences for the GRN: when a TF is modified 
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in our model, a whole column of the matrix {VFij} is 
affected at once. There will thus be a selective pressure 
on networks depending on how their essential interactions 
are spread across different columns, and this will change 
the out-degree distribution. 

Qualitatively, one expects broad distributions to be fa- 
vored; indeed, consider mutating a TF coding gene. If 
the TF is associated with at least one essential interac- 
tion, then such a mutation is typically highly deleteri- 
ous. Oppositely, if the TF has no essential interactions, 
the mutations will be innocuous. Hence, in the presence 
of TF mutations, there is a strong correlation between 
the robustness Rtf to mutations in the different TFs 
and column "occupancy" by essential interactions {i.e., 
the essential out-degree, to be denoted kout)- Let K be 
the total number of essential interactions (typically equal 
to N/2 in the results presented here), and let N occupied 
{N empty) be the number of occupied (empty) columns. Of 
course Noccupied + Ncmpty — N. Since mutating essential 
interactions is highly deleterious, the average robustness 
with respect to mutating a TF coding gene is accurately 
given by Rtf — Ncmpty /N and it is easy to convince one- 
self that (kout) = K /Noccupied where the average (•) is to 
be taken only over occupied coUumns. After elementary 
algebra one gets 



Rtf = 1 



K 



N{ko 



(13) 



In the examples discussed in this paper, K = N/2, and 
therefore 



2(fco«t) 



(14) 



a result fully confirmed by an explicit numerical simu- 
lation. Thus, GRN with smaller kout are disfavoured as 
soon as we allow TFs to mutate. 

To reveal the effects of selection, we must work with 
a population of GRN under mutation-selection balance 
and see the networks with broad out-degree distributions 
out-compete those with more narrow distributions. A 
large scale simulation of such a population would pro- 
vide the distribution of kout- Here we choose a simpli- 
fied approach based on effective evolutionary dynamics 
as follows. First, one can safely ignore all the inactive 
genes in s(*i'"9et)^ g^j^^j g^j-g jgf^ with K different TFs. 
Second, for each of these K genes, there will almost al- 
ways be exactly one essential (incoming) interaction. We 
thus map each microscopic genotype to an effective geno- 
type, specifying only the essential interactions, as can 
be justified from the decoupling of Eq. ([T^ . Third, we 
consider a population of such effective genotypes under- 
going mutation and selection. At the microscopic level, 
we showed in the previous section that mutations of the 
binding sites will lead to slow changes in the positions 
of the essential interactions. These changes arise at the 
time scale r which empirically we found to be several 
hundred sweeps, each sweep corresponding to LN'^ mu- 
tations. That time scale r is very large for the population 



dynamics under consideration so for simplicity think of 
l/r as being infinitesimal. Forth, at each generation we 
evolve the population of (effective) genotypes: the indi- 
viduals are in competition because mutations in the TF 
coding genes affect them differently. The relevant charac- 
teristic of each genotype is its mutational robustness Rt / 
to mutations in these genes. Thus Rtf acts as a fitness 
which is under selection, with large Rtf being favored; 
in view of Eq. pij) . this should tend to enhance large 
out-degrees. 

To see how this transpires within a population genet- 
ics framework, we consider an effective model where an 
essential interaction is represented by a "ball" , a column 
of {Wij} by a "box" and an individual (e.g. a cell) in the 
population by K boxes. The population of such "cells" 
is studied numerically simulating the well-known Moran 
process [l^] as follows. At each step, we take a random 
individual of the population; let Noccupied be its number 
of genes with out-degree greater or equal to 1. A selec- 
tion process is applied to this individual. With probabil- 



ity Rtf ^l- N 



occupied /K it is duplicated and another 
randomly chosen individual is removed from the popula- 
tion. Otherwise (and thus with probability Noccupied/ K), 
one tries another random individual, and so forth. With 
this Moran process, the population size stays constant 
as the fitness rises and reaches a limit. Higher fitness 
(robustness Rtf) corresponds via Eq. ([HI) to a broader 
out-degree distribution, large degrees being favored. 

The simulation of this simplified population model is 
much faster than if we were to use the microscopic geno- 
types. It is also far more intuitive: each effective geno- 
type can be thought of as a way to put K balls into K 
boxes. Each box is a TF and the balls in that box give 
the number of essential interactions associated with that 
TF. In Fig. [6] we display the resulting out-degree distri- 
bution for K — 10, 20 and 40. The initial condition on 
the simulation is that the balls are distributed at random 
among the K boxes. At this point a few comments are 
in order: 

- For the choice l/r = 0, the result of the Moran 
process may depend on the initial condition. Since, in 
our case, the distribution of essential interactions among 
columns of {VFij} generated by binding site and TF mu- 
tations is different, one may worry that the results shown 
in Fig. |6] change if one mixes individuals with binomial 
and flat ball distributions. We have found that mixing 
them in equal proportion leads to an out-degree distribu- 
tion that has almost the same shape as in Fig. El except 
that it is shifted to the left by about 0.5, a minor modi- 
fication. 

- In reality, r is not infinite, so at each generation, one 
should allow, at the rate l/r, balls to go from one box to 
another. We checked explicitly that hops of balls done 
with frequency 1/10000, still much larger than is realistic, 
do not change the result for the out-degree distribution 
in a perceptible way. 
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FIG. 6; Out-degree distribution within the simphfied geno- 
types (see text) when using the Moran process. The pop- 
ulation consists of 100000 genotypes and K = 10: squares, 
K = 20: triangles up, K = 40: triangles down (circles refer 
to the binomial distribution of mean 1, which for these values 
of K is almost independent). 



VI. DISCUSSION AND CONCLUSION 

We considered a quite simple model of a gene regula- 
tory network (GRN) in which function is identified with 
reaching given target gene expression levels. The key fea- 
ture which sets our framework apart is the introduction 
of molecular genotypes which thereby specify interaction 
weights in the GRN. Within this setting, we have shown 
how a "viable" (i.e., having the desired function) GRN is 
progressively constructed under selection pressure, when 
one forces the phenotype to increase its "fitness" by ap- 
proaching a previously defined optimal phenotype. We 
investigated the properties of networks in this model un- 
der the constraint that they be viable. We find that for 
a certain range of values of the model's parameters, the 
viability constraint leads to sparse GRN; we have quanti- 
fied this through the sparsity of "essential" interactions. 
Interestingly, the effects of the viability constraint con- 
dense onto just a few of the interactions, the others being 
non-functional. As a result, nearly all mutations of the 
binding sites have no effect on the viability and so such 
sites have a very high mutational robustness. However, 
for those few sites which bear the essential interactions, 
the majority of mutations are deleterious so their mu- 
tational robustness is low. Thus in our GRN, the mu- 
tational robustness is extremely heterogeneous from site 
to site. In addition, any "redundant" interaction is ex- 
pected to become lost under evolutionary dynamics since 
mutations will remove it and condense the burden of vi- 
ability onto a smaller number of interactions. We have 
also studied the consequences of TF mutations for the 
fate of a population. In contrast to the behavior of the 
in-degree distribution which is "as narrow as possible" , 
we find that the distribution of out-degrees develops a 
fat tail. 
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FIG. 7: The curves show the lower limit of the region where 
h = 107V[0.25 + Q.7bexp{~£)\^ when iV = 20. 



Although our modeling involves certain idealizations, 
its main characteristics are fairly realistic; in particu- 
lar we have insisted on including interactions through 
the biophysical mechanism of molecular recognition and 
affinity. It is therefore quite striking that a reasonable 
GRN topology comes out very naturally in this frame- 
work. It should be clear that this success is a result of 
the combined effect of several causes: the viability con- 
straint, the low probability of a small mismatch between 
TF and the binding site on the DNA, the size L of this 
segment, the not-too-small spacing (in units of fc^T) be- 
tween the energy levels that determine the strength of 
TF-DNA interactions, and finally the value of the thresh- 
old parameter h itself which enters the dynamics of the 
gene expression levels (c/. Eq. ([6])). 

This overall picture corresponds to having all factors 
Pij in Eq. ^ be rather small except for the one which is 
associated with the essential incoming interaction. Such 
a scenario arises when the threshold h is significantly 
larger than the sum of random Wij entering these factors 
(behaving, in practice, as if the viability constraint were 
absent). For a four letter code, assuming, as we do, that 
the binding energy is additive in mismatches and that 
every mismatch costs the same, one gets the condition 



/i>7V(0.25 + 0.75e"^)^ 



(15) 



Note that within a two letter code, the condition forces 
one to larger values of L (close to 20) and thus beyond 
what is realistic biologically. Since the model parameters 
correspond to measurable quanties, it is appropriate to 
compare to biological values. According to Eq. @ the 
probability that a TF occupies a DNA site is controlled 
hy h — \/n, where n is the number of these molecules. 
We considered the range 1/10000 < h < 1/10. Interpret- 
ing ^ as "larger by one order of magnitude", i.e., by a 
factor 10, one gets an allowed region in parameter space 
as illustrated in Fig. [T] 

This figure can be used to obtain the predicted domain 
of relevance: it is above the corresponding curves (taken 
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at = 20 and illustrative values of h). We see that e and 
L should not be too small. Moreover, it is gratifying that 
the experimental range of these parameters (indicated by 
a rectangle) is near the border and, most of it, within this 
region. The model would remain meaningful if L and e 
were even larger. However, in the analogue of Fig. |31 
the point at d = would dominate strongly over the few 
neighboring d points, and the GRN would be robust but 
not as evolvable [4l|, [l^l . It is worth emphasizing that as 
the number N of genes grows, it is necessary to increase 
slowly either L, e oi h. e is constrained by biophysical 
processes and thus not evolvable, and L seems the best 
candidate for the system to adapt to increasing N [43| . 
Note nevertheless that the effects of growing A'' are mild 
and that in practice regulation is modular, so effectively 
biological GRN have only modest values of N. 
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